102 research outputs found

    A case for asymmetric-cell cache memories

    Full text link

    NOC-Out: Microarchitecting a Scale-Out Processor

    Get PDF
    Scale-out server workloads benefit from many-core processor organizations that enable high throughput thanks to abundant request-level parallelism. A key characteristic of these workloads is the large instruction footprint that exceeds the capacity of private caches. While a shared last-level cache (LLC) can capture the instruction working set, it necessitates a low-latency interconnect fabric to minimize the core stall time on instruction fetches serviced by the LLC. Many-core processors with a mesh interconnect sacrifice performance on scale-out workloads due to NOC-induced delays. Low diameter topologies can overcome the performance limitations of meshes through rich inter-node connectivity, but at a high area expense. To address the drawbacks of existing designs, this work introduces NOC-Out – a many-core processor organization that affords low LLC access delays at a small area cost. NOC-Out is tuned to accommodate the bilateral core-to-cache access pattern, characterized by minimal coherence activity and lack of inter-core communication, that is dominant in scale-out workloads. Optimizing for the bilateral access pattern, NOC-Out segregates cores and LLC banks into distinct network regions and reduces costly network connectivity by eliminating the majority of inter-core links. NOC-Out further simplifies the interconnect through the use of low-complexity tree based topologies. A detailed evaluation targeting a 64-core CMP and a set of scale-out workloads reveals that NOC-Out improves system performance by 17% and reduces network area by 28% over a tiled mesh-based design. Compared to a design with a richly-connected flattened butterfly topology, NOC-Out reduces network area by 9x while matching the performance

    SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

    Get PDF
    Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. We present the sampling microarchitecture simulation (SMARTS) framework as an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates. Analysis of 41 of the 45 possible SPEC2K benchmark/ input combinations show CPI and energy per instruction (EPI) can be estimated to within 3% with 99.7% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in micro-architectural state initialization introduces an additional uncertainty which we empirically bound to /spl sim/2% for the tested benchmarks. Our implementation of SMARTS achieves an actual average error of only 0.64% on CPI and 0.59% on EPI for the tested benchmarks, running with average speedups of 35 and 60 over detailed simulation of 8-way and 16-way out-of-order processors, respectively

    Fat Caches for Scale-Out Servers

    Get PDF
    The authors propose a high-capacity cache architecture that leverages emerging high-bandwidth memory modules. High-capacity caches capture the secondary data working sets of scale-out workloads while uncovering significant spatiotemporal locality across data objects. Unlike state-of-theart dram caches employing in-memory block-level metadata, the proposed cache is organized in pages, enabling a practical tag array, which can be implemented in the logic die of the high-bandwidth memory modules

    FADE: A programmable filtering accelerator for instruction-grain monitoring

    Get PDF
    Instruction-grain monitoring is a powerful approach that enables a wide spectrum of bug-finding tools. As existing software approaches incur prohibitive runtime overhead, researchers have focused on hardware support for instruction-grain monitoring. A recurring theme in recent work is the use of hardware-assisted filtering so as to elide costly software analysis. This work generalizes and extends prior point solutions into a programmable filtering accelerator affording vast flexibility and at-speed event filtering. The pipelined microarchitecture of the accelerator affords a peak filtering rate of one application event per cycle, which suffices to keep up with an aggressive OoO core running the monitored application. A unique feature of the proposed design is the ability to dynamically resolve dependencies between unfilterable events and subsequent events, eliminating data-dependent stalls and maximizing accelerator’s performance. Our evaluation results show a monitoring slowdown of just 1.2-1.8x across a diverse set of monitoring tools

    CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

    Get PDF
    Abstract—Manycore chips are emerging as the architecture of choice to provide power efficiency and improve performance, while riding Moore’s Law. In these architectures, on-chip interconnects play a pivotal role in ensuring power and performance scalability. As supply voltages begin to level off in future technologies, chip designs in general and interconnects in particular will require specialization to meet power and performance objectives. In this work, we make the observation that cache-coherent manycore server chips exhibit a duality in on-chip network traffic. Request traffic largely consists of simple control messages, while response traffic often carries cache-block-sized payloads. We present Cache-Coherence Network-on-Chip (CCNoC), a design that specializes the NoC to fit the demands of server workloads via a pair of asymmetric networks tuned to the type of traffic traversing them. The networks differ in their datapath width, router microarchitecture, flow control strategy, and delay. The resulting heterogeneous CCNoC architecture enables significant gains in power efficiency over conventional NoC designs at similar performance levels. Our evaluation reveals that a 4x4 mesh-based chip multiprocessor with the proposed CCNoC organization running commercial server workloads is 15-28 % more energy efficient than various state-of-the-art singleand dual-network organizations. I

    Targeting the Pseudomonas aeruginosa Virulence Factor Phospholipase C With Engineered Liposomes.

    Get PDF
    Engineered liposomes composed of the naturally occurring lipids sphingomyelin (Sm) and cholesterol (Ch) have been demonstrated to efficiently neutralize toxins secreted by Gram-positive bacteria such as Streptococcus pneumoniae and Staphylococcus aureus. Here, we hypothesized that liposomes are capable of neutralizing cytolytic virulence factors secreted by the Gram-negative pathogen Pseudomonas aeruginosa. We used the highly virulent cystic fibrosis P. aeruginosa Liverpool Epidemic Strain LESB58 and showed that sphingomyelin (Sm) and a combination of sphingomyelin with cholesterol (Ch:Sm; 66 mol/% Ch and 34 mol/% Sm) liposomes reduced lysis of human bronchial and red blood cells upon challenge with the Pseudomonas secretome. Mass spectrometry of liposome-sequestered Pseudomonas proteins identified the virulence-promoting hemolytic phospholipase C (PlcH) as having been neutralized. Pseudomonas aeruginosa supernatants incubated with liposomes demonstrated reduced PlcH activity as assessed by the p-nitrophenylphosphorylcholine (NPPC) assay. Testing the in vivo efficacy of the liposomes in a murine cutaneous abscess model revealed that Sm and Ch:Sm, as single dose treatments, attenuated abscesses by >30%, demonstrating a similar effect to that of a mutant lacking plcH in this infection model. Thus, sphingomyelin-containing liposome therapy offers an interesting approach to treat and reduce virulence of complex infections caused by P. aeruginosa and potentially other Gram-negative pathogens expressing PlcH

    Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models

    Get PDF
    Hardware-assisted instruction-grain monitoring frameworks provide high-coverage, low overhead debugging support for parallel programs. Unfortunately, existing frameworks are ill-suited for the relaxed memory models employed by nearly all modern processor architectures—e.g., TSO (x86, SPARC), RMO (SPARC), and Weak Consistency (ARMv7). For TSO, prior proposals hint at a solution, but provide no implementation or evaluation, and fail to correctly handle important corner cases such as byte-level dependences. For more relaxed memory models such as RMO and Weak Consistency, prior frameworks deadlock, rendering them unable to detect any bugs past the first deadlock! This paper presents Resolve, the first hardware-assisted instruction-grain monitoring framework that is complete, correct and deadlock-free under relaxed memory models. Resolve is based on the observation that while relaxed memory models can produce cycles of dependences that deadlock prior approaches, these cycles can be overcome by consulting the dataflow graph of the application threads being monitored, instead of their program order. Resolve handles all possible cycles arising in relaxed memory models, through a careful approach that uses both dataflow-based processing and versioning of monitoring state, as appropriate. Moreover, we provide the first quantitative characterization of the cycles arising under RMO, demonstrating that such cycles are prevalent and persistent, and hence deadlock is a real problem that must be addressed. Yet they are not so frequent or complex, so that Resolve’s overheads are negligible. Finally, we present a simple and novel hardware mechanism for properly synchronizing updates to monitoring state under relaxed memory models, improving performance by up to 35% over the judicious use of memory fences

    SimFlex: Statistical Sampling of Computer System Simulation

    Get PDF
    Timing-accurate full-system multiprocessor simulations can take years because of architecture and application complexity. Statistical sampling makes simulation-based studies feasible by providing ten-thousand-fold reductions in simulation runtime and enabling thousand-way simulation parallelis

    BugSifter: A Generalized Accelerator for Flexible Instruction-Grain Monitoring

    Get PDF
    Software robustness is an ever-challenging problem in the face of today's evolving software and hardware that has undergone recent shifts. Instruction-grain monitoring is a powerful approach for improved software robustness that affords comprehensive runtime coverage for a wide spectrum of bugs and security exploits. Unfortunately, existing instruction-grain monitoring frameworks, such as dynamic binary instrumentation, are either prohibitively expensive (slowing down applications by an order of magnitude or more) or offer limited coverage. This work introduces BugSifter, a new design that drastically decreases monitoring overhead without sacrificing flexibility or bug coverage. The main overhead of instruction-grain monitoring lies in execution of software event handlers to monitor nearly every application instruction to check for bugs. BugSifter identifies common monitoring activities that result in redundant monitoring actions, and filters them using general, light-weight hardware, eliminating the majority of costly software event handlers. Our proposed design filters 80-98% of events while monitoring for a variety of commonly-occurring bugs, delegating the rest to flexible software handlers. BugSifter significantly reduces the overhead of instruction-grain monitoring to an average of 40% over unmonitored application time. BugSifter makes instruction-grain monitoring practical, enabling efficient and timely detection of a wide range of bugs, thus making software more robust
    • 

    corecore